and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task...
Transcript of and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task...
Reinforcement Learning for the People and/or by the People
Emma Brunskill
Stanford University
NIPS 2017 Tutorial
1
Amazing Reinforcement Learning Progress
≠
Overview
● RL introduction● RL for people● RL by the people
Audience
• If you are:– Interested in quick overview of RL (section 1)– Want to learn about the RL technical challenges
involved in people-facing applications (section 2)– Want to learn about how people can help RL
systems (section 3)
5
*Caveats
• Not trying to cover all the domain-specific methods for tackling these questions
• Focus will be on RL setting, and new technical challenges for RL with people and using people
• Always delighted to learn about new areas/key references might have missed-- email me at [email protected]
6
Math exercise
Student’s answer
Reinforcement Learning for People
Student
Math exercise
Student’s answer
Pass exam
Reinforcement Learning for People
Student
Suggest ProductBuys
product
Revenue
Reinforcement Learning for People
Policy: Prior Recommendations & Purchases → Product AdGoal: Choose actions to maximize expected revenue
Customer
Action Observation
Reward
Reinforcement Learning
Policy: Map Observations → ActionsGoal: Choose actions to maximize expected rewards
Overview
● RL introduction● RL for people● RL by the people
RL Basics
● MDP● POMDP● Planning● 3 views
○ Model○ Model-free○ Policy search
● Exploration vs exploitation-- how get data?
Markov Decision Process (MDP)
• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)
– Probability of reaching s’ after taking action a in state s
• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H
Markov Decision Process (MDP)
• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)
– Probability of reaching s’ after taking action a in state s
• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H• Policy π: s → a• Optimal policy is one with highest expected
discounted sum of future rewards
Partially Observable MDP
• Set of states S• Set of actions A• Set of observations Z• Stochastic transition/dynamics model T(s,a,s’)
– Probability of reaching s’ after taking action a in state s
• Reward model R(s,a) (or R(s) or R(s,a,s’))• Observation model P(z|s’,a)• Policy π: history (z,a,z’,a’)... → a• Optimal policy is one with highest expected
discounted sum of future rewards
Model-Based Policy Evaluation
• Given a MDP and a policy π: s → a• How good is this policy?• Want to compute expected sum of discounted
rewards if follow it (starting from some initial states)• Could do Monte Carlo rollouts and average• But if domain is Markov, can do better...
Model-Based Policy Evaluation with Dynamic Programming
• Given a MDP and a policy π: s → a• Set V(s) = 0 for all s;
• Iteratively update value
• V(s) ← R(s) + γ Σs’ T(s, (s),s′) V(s′)
immediate reward
discounted sum of future rewards
Planning & Bellman Equation
• Given a MDP (includes dynamics and reward model)• Compute optimal policy• Bellman optimality equation:
– V(s) = R(s) + γ Σs’ T(s, *,s′) V(s′)
– Q(s,a) = R(s) + γ Σs’ T(s,a,s′) V(s′)
– *(s) = argmaxa R(s) + γ Σ
s’ T(s,a,s′) V(s′) = argmax
a Q(s,a)
Reinforcement Learning
• Typically still assume in a MDP• Know set of states S• Know set of actions A• Maybe a discount factor γ or horizon H• Don’t know dynamics and/or reward model!• Still want to take actions to yield high reward
RL: 3 Common Approaches
Image from David Silver
Model-Based RL
• Typically still assume in a MDP• As interact in world, observe states, actions and
rewards• Use machine learning to estimate a model of
dynamics and/or rewards using this experience• Now have (one or more) MDP estimated models• Now can do planning to compute an optimal policy
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s
t,a
t,r
t,s
t+1>
– Calculate temporal difference errorδ(s
t,a
t) = r
t + γ max
a′ Q(st+1
,a′) - Q(st,a)
– Difference between what observed and current estimate of long term expected reward (Q)
– Uses Markov property and bootstraps (didn’t observe reward till end of episode, so use Q(s
t+1,a′) as a proxy)
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s
t,a
t,r
t,s
t+1>
– Calculate TD-errorδ(s
t,a
t) = r
t + γ max
a′ Q(st+1
,a′) - Q(st,a)
– Use to update estimate of Q(st,a
t)
Q(st,a
t) = (1-α) Q(s
t,a
t) + αδ(s
t,a
t)
• Slowly moves estimate toward observed experience
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s
t,a
t,r
t,s
t+1>
– Calculate TD-errorδ(s
t,a
t) = r
t + γ max
a′ Q(st+1
,a′) - Q(st,a)
– Use to update estimate of Q(st,a
t)
Q(st,a
t) = (1-α) Q(s
t,a
t) + αδ(s
t,a
t)
• Computationally cheap• But only propagates experience one step
– Replay can help fix
Policy Search
• Directly search π space for argmaxπVπ
• Parameterize policy and do stochastic gradient descent
RL: 3 Common Approaches
Image from David Silver
Exploration vs Exploitation
● In online reinforcement learning, learn about world through acting
● Trade off between○ Learning more about how the world works○ Using that knowledge to maximize reward
Overview
● RL introduction● RL for people● RL by the people
Reinforcement Learning Progress
≠
≠
Cheap to try things, orSimulate
High stakesHard to model
Technical Challenges in RL for People
1. Sample efficient Learning2. What do we want to optimize?3. Batch (purely offline) reinforcement learning
Sample Efficiency Through Transfer
all different all the same
Assume Finite Set of Models
MDP Y
TY, R
Y
MDP R
TR, R
R
MDP G
TG, R
G
Sample a task from finite set of MDPs
(shared S & A space)
Assume Finite Set of GroupsMulti-Task Reinforcement Learning
…
Series of tasksAct in each task for H steps
MDP Y
TY, R
Y
MDP R
TR, R
R
MDP G
TG, R
G
Multi-Task Reinforcement Learning
…
MDP Y
TY, R
Y
MDP R
TR, R
R
MDP G
TG, R
G
Multi-Task Reinforcement Learning
…
MDP Y
T=? R=?
MDP R
T=? R=?
MDP G
T=? R=?
Multi-Task Reinforcement Learning
• Captures a number of settings of interest• Our primary contributions have been showing can provably speed learning
(Brunskill and Li UAI 2013; Brunskill and Li ICML 2014; Guo and Brunskill AAAI 2015)
• Limitations: focused on discrete state and action, impractical bounds, optimizing for average performance
Assume Related Meta-Groups
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
Encouraging empiricallyNo guarantees
Scalability?
Hidden Parameter MDP
● Allow for smooth linear parameterization of dynamics model
Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432).
Hidden Parameter MDP++
● Use Bayesian Neural Nets for dynamics● Benefits for HIV Treatment simulation● Each episode new patient
TW Killian, G Konidaris, F Doshi-Velez. Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes. NIPS 2017.
Sample Efficient Online RL: Policy Search
• Policy search hugely popular technique in RL– Deep Reinforcement Learning through Policy
Optimization. – Schulman and Abbeel NIPS 2016 tutorial:
https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf
• Not always associated with data efficiency• But can be used to tune small set of
parameters efficiently
Thomas and B, ICML 2016
Policy Search as Function Optimization
Figure from Wilson et al. JMLR 2014
π = f(θ)
DefinesPolicy class
Perf
orm
ance
of
Polic
y
Policy Search as Function OptimizationGradient Approaches
• Use structure of sequential decision making• Only find local optima
π = f(θ)
DefinesPolicy class
Perf
orm
ance
of
Polic
y
Figure from Wilson et al. JMLR 2014
Policy Search as Bayesian Optimization:Treats Each Evaluation as Costly
π = f(θ)
DefinesPolicy class
Perf
orm
ance
of
Polic
y
• Explicit rep uncertainty, finds global optima• Generally ignores sequential structure
Figure from Wilson et al. JMLR 2014
Policy Search as Bayesian Optimization
● Data efficient policy search
● Can leverage Markovian structure (e.g. Wilson et al. 2014)
● Doesn’t have to make assumptions about world model
● Can combine with off policy evaluation to further speed up learning (in terms of amount of data required)
Goel, Dann and Brunskill IJCAI 2017
Policy Search for Optimal Stopping Problems
● Can leverage full trajectories to retrospectively evaluate alternate optimal stopping policies
● Substantial increase in data efficiency over generic bounds (Ng and Jordan UAI 2000)
● But Gaussian Processes struggle in high dimensions
Goel, Dann and Brunskill IJCAI 2017
Assuming Finite Set of Policies to Speed Learning in a New Task
● Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 720-727). ACM.
● Talvitie, E., & Singh, S. P. (2007, January). An Experts Algorithm for Transfer Learning. In IJCAI (pp. 1065-1070).
● Azar, M. G., Lazaric, A., & Brunskill, E. (2013, September). Regret bounds for reinforcement learning with policy advice. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 97-112).
● Focus on the discrete state and action space setting
Multi-Task Policy Learning
Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In IIJCAI.
Multi-Task Policy Learning
Goel, Dann and Brunskill IJCAI 2017
● Set of policies with shared basis set of parameters● Can be used to do cross domain transfer (different state & actions)
Multi-Task Policy Learning with Shared Policy and Feature Parameters
● Descriptor features and policy features● Primarily tackling continuous control tasks
● Scaling to enormous state spaces and performance for a single trajectory less clear
Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).
Deep RL Transfer
● Deep reinforcement learning to find good shared representation (Finn, Abbeel, Levine ICML 2017)
● Fast transfer by encouraging shared representation learning across tasks
Direct Policy Search
• Most of these are trying to speed online learning in a new task
• Not making guarantees on performance for a single task
(Some) Recent Sample Efficient RL References
Brunskill, E., & Li, L. Sample Complexity of Multi-task Reinforcement Learning. UAI 2013.
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432). NIH Public Access.
Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes.TW Killian, G Konidaris, F Doshi-Velez. NIPS 2017
Goel, Dann, & Brunskill. Sample Efficient Policy Search for Optimal Stopping Domains. IJCAI 2017.
Guo, Zhaohan, and Emma Brunskill. "Concurrent PAC RL." AAAI 2015.
Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., & Schapire, R. E. (2017). Contextual Decision Processes with Low Bellman Rank are PAC-Learnable. ICML.
Silver, D., Newnham, L., Barker, D., Weller, S., & McFall, J. (2013, February). Concurrent reinforcement learning from customer interactions. In International Conference on Machine Learning (pp. 924-932)
Carlos Diuk, Lihong Li, and Bethany R. Leffler. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In ICML, 2009.
Kirill Dyagilev, Shie Mannor, and Nahum Shimkin. Efficient reinforcement learning in parameterized models: Discrete parameter case. In European Workshop on Reinforcement Learning, 2008
Liu, Y., Guo, Z., & Brunskill, E. PAC Continuous State Online Multitask Reinforcement Learning with Identification. AAMAS 2016.
Osband, I., Russo, D., & Van Roy, B. (2013). (More) efficient reinforcement learning via posterior sampling. NIPS
Osband, I., & Van Roy, B. (2014). Near-optimal Regret Bounds for Reinforcement Learning in Factored MDPs. NIPS.
Zhou, L., & Brunskill, E. Latent Contextual Bandits and Their Application to Personalized Recommendations for New Users. IJCAI 2016.
Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015, July). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In International Joint Conference on Artificial Intelligence (pp. 3345-3351)
Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).
Finn, C., Abbeel, P., & Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017.
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
Beyond Expectation
• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds
Beyond Expectation
• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds
• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning
Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th
International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active
learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information
processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing
systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient
exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.•
Beyond Expectation
• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds
• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning
Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th
International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active
learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information
processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing
systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient
exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.
• Work on safe exploration includes• Geramifard, A. (2012). Practical reinforcement learning using representation learning and safe exploration for large scale Markov
decision processes (Doctoral dissertation, Massachusetts Institute of Technology).• Moldovan, T. M., & Abbeel, P. (2012). Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.481• Turchetta, M., Berkenkamp, F., & Krause, A. (2016). Safe exploration in finite Markov decision processes with Gaussian processes. In
Advances in Neural Information Processing Systems (pp. 4312-4320).• Gillula, J. H., & Tomlin, C. J. (2012, May). Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In
Robotics and Automation (ICRA), 2012 IEEE International Conference on (pp. 2723-2730). IEEE.• Many make assumptions on structural regularities of the world model
≠
Limited prior data Often lots of prior data!
≠
Limited prior data Often lots of prior data!
Batch (Entirely Offline) RL
A Classrooms Avg Score: 95
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
What should we do for a new student?
Comes Up in Many Domains: e.g. Equipment Maintenance Scheduling
Comes Up in Many Domains: e.g. Patient Treatment Ordering
B Classrooms Avg Score: ????
B Classrooms Avg Score: 92
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
Challenge: Counterfactual Reasoning
B Classrooms Avg Score: ????
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
Challenge: Generalization to Untried Policies
Counterfactual Estimation
● Active area of research in many fields● Today focus on this in the context of reinforcement learning (sequential
decision making)● A few pointers to some other work and settings:
○ Saria and Schulam. ML Foundations and Methods for Precision Medicine and Healthcare. NIPS 2016 Tutorial.
○ T. Joachims, A. Swaminathan, Tutorial on Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement, ACM Conference on Research and Development in Information Retrieval (SIGIR), 2016.
○ Wager, S., & Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (to appear)
○ Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3), 399-424.
○ T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, T. Joachims, Recommendations as Treatments: Debiasing Learning and Evaluation, International Conference on Machine Learning (ICML), 2016.
○ Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322-331.
Batch Data Policy
Evaluation
Data on decisions & outcomes
Batch Data Policy
Evaluation
Data on decisions & outcomes
Policy 1
Policy 3
Policy N
Policy 2
Batch Data Policy
Evaluation
Data on decisions & outcomes
... ...
Policy 1 Estimated Performance 1
Policy 3 Estimated Performance 3
Policy N Estimated Performance N
Policy 2 Estimated Performance 2
Policy Evaluation
Batch Data Policy
Selection
Data on decisions & outcomes
... ...
Policy i
Policy Selection
Policy 1 Estimated Performance 1
Policy 3 Estimated Performance 3
Policy N Estimated Performance N
Policy 2 Estimated Performance 2
Policy Evaluation
Batch RL for People
• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators
– Valid confidence intervals – Consistency
Batch RL for People
• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators
– Valid confidence intervals – Consistency
• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator
Batch RL for People
• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators
– Valid confidence intervals – Consistency
• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator• Lots of great work on doing off policy policy learning
– Here assume only get access to fixed batch of data (no more learning) and care about accuracy of result
• ...
Policy: Player state → levelGoal: Maximize engagement Old data: ~11,000 students
78
Data on decisions & outcomes
Predictive statistical model of player
behavior
Build Predictive Model
Reward
Action Observation
Compute Policy that Optimizes Expected
Rewards for this Model
Predictive statistical model of player
behavior
Use Models as a Simulator
Reward
Action Observation
Problem: Model May Not be Accurate… Yields Poor Estimate of Policy Performance
Predictive statistical model of player
behavior
Compute Policy that Optimizes Expected
Rewards for this Model
Worse: More Accurate Models Can Yield Even Poorer Performing Policies?
Mandel, Liu, B and Popovic AAMAS 2014
Data on decisions & outcomes
Predictive statistical model of player
● Much prior work: if model good, policy is good
Compute Best Policy for Model
● Much prior work: if model good, policy is good● Challenge: model class may be wrong
○ May model as Markov but not Markov○ If compute policy assuming model is Markov,
resulting value estimate is not valid
Using Estimators that Rely on Model Class Being Correct Can Fail
Data on decisions & outcomes
Predictive statistical model of player
Compute Best Policy for Model
● Much prior work: if model good, policy is good● Challenge: model class may be wrong● How can we identify if model type is wrong?
○ Sometimes feasible, sometimes hard● Relates to
○ Sim2Real problem○ Adversarial examples
Using Estimators that Rely on Model Class Being Correct Can Fail
Data on decisions & outcomes
Predictive statistical model of player
Compute Best Policy for Model
Prior Work: Estimate Model from Data
• Strengths– Low variance estimator of policy performance
• Weaknesses– Not unbiased estimator (model may be poor!)
– Not consistent estimator
Build & Estimate a Model
• Define states and actions
• Dynamics p(s’|s,a)
• Observation model p(z|s,a)
• Rewards r(s,a,z)
Historical DataHistory
1, R
1=Σ
ir
i1History
2, R
2=Σ
ir
i2…
Key Challenge: Distribution Mismatch
Histories
Behavior Policy
E[Σir]
• Rewards r(s,a,z)
• Policy maps history (a,z,r,a’,z’,…) → a
New Policy
Histories
E[Σir
i]?
Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)
Historical DataHistory
1, R
1=Σ
ir
i1History
2, R
2=Σ
ir
i2…
Estimate of Behavior Policy π
b Performance
Estimate of Evaluation Policy π
e Performance
Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)
• Strengths– Unbiased estimator of policy performance
– Strongly consistent estimator of policy perform.
• Weaknesses– High variance estimator*
*Many extensions, including Retrace (NIPS 2016) but most focused on online off policy setting
Historical DataHistory
1, R
1=Σ
ir
i1History
2, R
2=Σ
ir
i2…
Estimated Evaluation Policy π
e performance
We used off-policy evaluation to find a policy with 30% higher
engagement (Mandel et al. AAMAS 2014)
Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide from Thomas)
High Confidence Off-Policy Policy Evaluation
High Confidence Off-Policy Policy Evaluation
Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide modified from Thomas)
High Confidence Off-Policy Policy Evaluation
Thomas, Theocharous, & Ghavamzadeh AAAI 2015
High Confidence Off-Policy Policy Evaluation
~2 million trajectories
Thomas, Theocharous, & Ghavamzadeh AAAI 2015
High Confidence Off-Policy Policy Improvement
• Use approximate confidence intervals
• Policy evaluation interleaved with running policy for high confidence iterative improvement
• Domain horizon (10) still quite small
Thomas, Theocharous, & Ghavamzadeh ICML 2015
Two Extremes of Offline Reinforcement Learning
ML model + plan
+ Data efficient- Biased
Importance sampling
- Data intensive+ Unbiased
96
Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg
Doubly Robust Estimation
ML model + plan
+ Data efficient- Biased
Importance sampling
- Data intensive+ Unbiased
97
Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg
+
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
Doubly Robust Estimation for RL
• Jiang and Li (ICML 2016) extended DR to RL
• Limitation: Estimator derived is unbiased
model-based estimate of Q
actual rewards in the dataset
importance weights
model-based estimate of V
Tight Estimate Often Better than Unbiased: Measure with Mean Squared Error
Thomas and Brunskill, ICML 2016
• Trade bias and varianceBias
Bias
Variance
+Model-based estimator
Importance sampling estimator
Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error
Bias
Variance
1-step estimate 2-step N-step
x1 x
2 … x
N
Thomas and Brunskill, ICML 2016
Model and Guided Importance Sampling combining (MAGIC) Estimator
Estimated policy value using particular weighting of model estimate and
importance sampling estimate
Thomas and Brunskill, ICML 2016
• Solve quadratic program
• Strongly consistent*
MAGIC Can Yield Orders of Magnitude Better Estimates of Policy Performance
Log scale!
IS-based
Model
DR
MAGIC
MAGIC-B
Number of Histories
Thomas and Brunskill, ICML 2016
MAGICal Policy Search
Thomas and Brunskill, EWRL 2016
107
Variance of Importance Sampling Can Grow With Exponentially with Time Horizon
Trading Bias & Variance for Long Horizon Off Policy Policy Evaluation
● Importance sampling based estimator● Dropping weights (covariance) reduces
variance at cost of bias● Strongly consistent
Guo, Thomas and Brunskill, NIPS 2017
weights for first part of trajectory weights for 2nd
part of trajectory
Trading Bias & Variance for Long Horizon Off Policy Policy EvaluationPromising for some forms of evaluation in Atari
But subtle: depends on particular form of rewards and policies
*Note: updated since original tutorial given a bug found in original plots
Guo, Thomas and Brunskill, NIPS workshop 2017
High Confidence Bounds for Weighted Doubly Robust Estimation● High confidence bounds for weighted doubly robust off-policy policy estimation
(Hanna, Stone, Niekum AAMAS 2017)
Fairness of Importance Sampling for Policy Selection
• Though IS policy estimates are unbiased, policy selection using them can be unfair
• Define unfair as choose wrong policy as having higher performance > 50% of time
• With finite data, can bias towards certain policies
– more myopic / shorter trajectories
Value
Policy 1 Policy 2
Doroudi, Thomas and Brunskill, UAI 2017, Best Paper
Re-thinking Models: Robust Matrix Evaluation
• When data is very limited, and horizon is long
• IS estimators still too high variance
• Train multiple models & simulate policies on each
• Can use for minimax policy selection
Doroudi, Aleven and Brunskill, Learning at Scale, 2017
Model \ Policy Policy 1 Policy 2
Model 1 8 1
Model 2 10 27
Re-thinking Models
• Which models to include?
• What are good models with limited data?
• Bayesian Neural Networks seem promising
– Applications to stochastic dynamical systems (e.g. Depeweg, Hernández-Lobato, Doshi-Velez, & Udluft, ICLR 2017)
Overview
● RL introduction● RL for people● RL by the people
RL By the People
● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space
*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/
Human Specifies Reward
● Is the true reward the best reward for the agent to learn from? E.g.○ Singh, S., Lewis, R. L., & Barto, A. G. Where do
rewards come from. CogSci 2009.● Given (computationally) bounded agent may not be● Human may not write down reward really want● May want constraints on behavior (Thomas, Castro da
Silva, Barto, Brunskill arvix 2017)
Inverse Reward DesignHadfield-Menell et al., NIPS 2017
● Reward is an observation of true desired reward function. ● Compute Bayesian posterior over true reward● Compute risk-averse policy. ● Avoids new bad things● Also avoids new good things
Image credit: http://www.infocuriosity.com/king-midas-and-his-golden-touch-an-ancient-greek-mythology-for-children/
Human Specifies Reward
● May be hard for people to specify reward● Can use as partial specification / shaping● Do RL on top of specified reward
Human Provides Demonstrations
● Imitation / IRL / learning from demonstration / apprenticeship learning
● Enormously influential, especially in robotics● Recent tutorial at ICRA: http://lasa.epfl.ch/tutorialICRA16/ ● Key idea: human provide demonstrations and agent uses
these to learn task● Here assume that get access to an initial set of trajectories &
then have no more interactions with the person● Goal is to learn a good policy
Behavioral Cloning
● Key challenge: data is not iid
Figure from Ross, S., & Bagnell, J. A. (2012). Agnostic system identification for model-based reinforcement learning. ICML.
Human Provides Demonstrations: Inverse Reinforcement Learning
● Learn reward function from human demonstrations ● Develop a policy given that reward function
○ Often involves assuming have access to dynamics model or ability to try new policies online.
○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful○ A number of recent efforts to combine with deep learning
Human Provides Demonstrations: Guided Cost Learning
● Learn reward function from human demonstrations ● Develop a policy given that reward function
○ Often involves assuming have access to dynamics model or ability to try new policies online.
○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful
Finn, C., Levine, S., & Abbeel, P. (2016, June). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (pp. 49-58).
Human Provides Demonstrations: Generative Adversarial Imitation Learning
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (pp. 4565-4573).
# traj on x-axis, y performance
Some Benefits & Limitations to Learning from Demonstrations
● Need access to experts doing task. Just using these can be limited so often combine with online RL which not always feasible
● Some cases expensive/slow to gather data (teach student across a year? provide customer recommendations across months)
● May not help if the solution we need is radically far away-- still have to handle exploration/exploitation trade off online afterwards
● Still a question of how capture expertise (e.g. what state features to use)
● But generally very promising
○ Demonstrations can be prior data collected for other purposes
○ Great for automating, scaling up, & fine tuning good solutions
Humans Providing Online Rewards
● Sophie’s Kitchen● Human trainer can award a scalar reward signal r = [−1, 1]
Teachable robots: Understanding human teaching behavior to build more effective robot learners" Artificial Intelligence Journal, 172:716-737, 2008..
Humans Providing Online Rewards
W. Bradley Knox and Peter Stone. Combining Manual Feedback with Subsequent MDP Reward Signals for Reinforcement Learning. In Proceedings of the Ninth International Conference on Autonomous Agents and Multiagent Systems. May 2010. Best student paper
● TAMER framework● Uses reward
feedback to help shape the agent’s reward
● Can be used in addition to other reward signals from domain
Human Provides Advice / Labels / Feedback
● Human will continue to provide advice about policies over time
Ross, S., Gordon, G. J., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics (pp. 627-635).
Human Provides Advice
Griffith, S., Subramanian, K., Scholz, J., Isbell, C., & Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems (pp. 2625-2633).
● Advise: a Bayesian approach for policy shaping● Agent may get “right”/“wrong” label after performing an action
● C: feedback consistent with optimal policy with probability 0 <C <1
● Update posterior over policy from human
● Combine policy learned from
own experience with that estimated
from human
Discussion: Human Provides Advice / Labels / Feedback
● Performance gains are often significant● Challenge: expensive to have someone in the loop. not often
feasible. (number of examples of advice can be in the hundreds)● Maybe in some cases (household robot, asking a supervisor for
help with a tricky case) could be realistic and easier than an expert providing demonstrators
● MIght be some interesting merged caes:○ Learning from demonstrators where can finish the
demonstration of someone once stuck?● Could also potentially be good for unknown unknowns
○ Human may help agent to realize there’s a problem/limitation even if the agent would not have
Humans Teaching Agents
● How do people try to teach?● Evidence that (at least for animal training) people may expect their reward feedback to be
interpreted as an action label ○ This is different if learner is trying to maximize its long term reward
○ Ho, M. K., Littman, M. L., Cushman, F., & Austerweil, J. L. (2015). Teaching with rewards and punishments: Reinforcement or communication?. In CogSci.
How do People Teach
● People can perform tasks differently if trying to show vs doing task at very high level of performance
○ Ho, M. K., Littman, M., MacGlashan, J., Cushman, F., & Austerweil, J. L. (2016). Showing versus doing: Teaching by demonstration. In Advances in Neural Information Processing Systems (pp. 3027-3035)
How do People Teach
● People pay attention to learners’ current policy human trainers give a positive or negative feedback for a decision is influenced by the learner’s current policy which influences how the teacher provides feedback, and this influences the best algorithm to do○ James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David
L. Roberts, Matthew E. Taylor, Michael L. Littman . Proceedings of the 34th International Conference on Machine Learning,
Cooperative Inverse RLHadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (NIPS 2016).
● Game where human and the agent get rewards determined by the same reward function.
● If the learner has different capabilities from teacher, teacher should teach in a way to help learning agent learn its optimal policy (which may be different than if the teacher performed the task herself)
How Should We Teach Agents
● Machine teaching: optimal way to teach agent given knowledge of agent’s algorithm○ Zhu, X. (2015, January). Machine Teaching: An Inverse Problem to Machine Learning
and an Approach Toward Optimal Education. In AAAI (pp. 4083-4087).● How to teach a sequential decision making algorithm e.g.
○ Cakmak, M., & Lopes, M. (2012, July). Algorithmic and Human Teaching of Sequential Decision Tasks. In AAAI.
○ Walsh, Thomas J., and Sergiu Goschin. (2012). "Dynamic teaching in sequential decision making environments. In UAI.
○ Peng, B., MacGlashan, J., Loftin, R., Littman, M. L., Roberts, D. L., & Taylor, M. E. (2017, May). Curriculum Design for Machine Learners in Sequential Decision Tasks. AAMAS
Overview
● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space
*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/
Histogram Tutor
136
At end, post test
Action Correct/Wrong
Continually Improving Tutoring System
Action
Improving Across Many Students
Action Action
Over Time Tutoring System Stopped Giving Some Problems to Students
139
System Self-Diagnosed that Problems Weren’t Helping Student Learning
Rules of the World Are Not Fixed
≠
Humans are Invention Machines
New actions New sensors
Reward
Action Observation
Goal: Choose actions to maximize expected rewards
Human in the Loop Reinforcement Learning
Reward
Action Observation
Goal: Choose actions to maximize expected rewards
Human in the Loop Reinforcement Learning
Add Action
Reward
Action Observation
Goal: Choose actions to maximize expected rewards
Add Action
Direct Human Effort for Adding New ActionsMandel, Liu, Brunskil & Popovic, AAAI 2017
Expected Local Improvement
Prob. human gives you action
ah
for state s
Improvement in value (outcomes) at state s if add in action a
h
Mandel, Liu, Brunskil & Popovic, AAAI 2017
Mostly Bad Human Input
Mandel, Liu, Brunskil & Popovic, AAAI 2017
Ask people to add in new hints where might help
Reward
Action
Observation
Add Feature
Is there a Latent Feature that if System Knew It, Could Make Better Decisions?
Ongoing work with Ramtin Keramati
Ongoing work with Ramtin Keramati
Ask for Feature if Exists Latent Variable Model of a State That Could Change Policy
Ongoing work with Ramtin Keramati
Ask for Feature if Exists Latent Variable Model of a State That Could Change Policy
Ongoing work with Ramtin Keramati
Learn Model That Supports Making Good Decisions, Not Perfect Model
Reward
Action
Observation
Requests
Directing Human Experts to Change Actions & Observation Space
Ongoing work with Ramtin Keramati
Reward
Action
Observation
Requests
Another Way to Have Humans Teach Agents
Ongoing work with Ramtin Keramati
Ask people to add in new hints where might help
But system isn’t improving much… what’s going on?
Humans Teaching Agents: Questions
● How to teach human to teach agents?○ People may not have a good model of (a) agent / learner, (b) access
to content (actions) that can change that agent/ learner state○ May need guidance about how best to help agent / learner○ Just like people have to learn to be effective human teachers…○ In our group new project on teaching the teachers...
● Lots of open directions here● Involving teachers still expensive relative to leveraging prior trajectories
of demonstrations or instructions● Can we make better algorithms to leverage past teaching demonstrations
(youtube has many lecture videos) rather than expert demonstrations? COACH (on prior slide, MacGlashan, et al) is one of the steps in this direction
RL with help from People
● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space
*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/
Overview
● RL introduction● RL for people● RL by the people
● Lots of applications that could benefit from RL● Lots of ways people can help make RL systems
better● Interested in discussing a postdoc opportunity?
Email me at [email protected]
Thanks to
and Karan Goel, Travis Mandel, Yun-En Liu, Ramtin Keramti, NSF, ONR, Microsoft, Google,
Yahoo & IES