A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory...

A Unifying Framework for Computational Reinforcement

Learning TheoryLihong Li

Rutgers Laboratory for Real-Life Reinforcement Learning (RL3)Department of Computer Science, Rutgers University

PhD Defense CommitteeMichael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy

Joint work withMichael Littman, Alex Strehl, Tom Walsh, …

Lihong Li 204/17/2009

$ponsored $earch

Are these better alternatives? Need to EXPLORE!

Lihong Li 304/17/2009

Thesis

The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way

for creating and analyzing RL algorithmswith provably efficient exploration.

Lihong Li 404/17/2009

Outline

• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL

– Model-based Approaches– Model-free Approaches

• Conclusions

Lihong Li 504/17/2009

100K dimensionaldialog state,features, etc.

Reinforcement Learning Example

Speech Recognition,NLP, Belief Tracking, etc.

Language Generation,Text-to-speech, etc.

responsesto user

“May I speak toJohn Smith?”

Confirm(“John Smith”)

“So you want to callJohn Smith, is that right?”

-1 per response+20 if succeeds-20 if fails

reward

Optimizedby RL

Dialog Design Objectivesucceed in conversationwith fewest responses

AT&T Dialer [Li & Williams & Balakrishnan 09] states

actions

Want to call someone at AT&T

Lihong Li 604/17/2009

RL Summary

European Workshop on Reinforcement Learning 2008

Define reward and let the agent chase it!

Lihong Li 704/17/2009

Markov Decision Process

• Environment is often modeled as an MDP

s1 s2 st

Set of states

Set of actions

Transition probabilities

Reward function

Discount factor in (0,1)

, , , ,M S A T R

1 1 1( , )r R s a

2 1 1( | , )s T s a ta

( , )t t tr R s a

1 ( | , )t t ts T s a

Regularity Assumption:

0 ( , ) 1R s a

Lihong Li 804/17/2009

Policies and Value Functions

• Policy:

• Value function:

• Optimal value function:

• Optimal policy:

• Solving an MDP:

21 2 3( , )Q s a r r r

*( , ) max ( , )Q s a Q s a

* *( ) arg max ( , )a

s Q s a

* * arg max ( , ) ( ) a

Q Q Q s a s

Lihong Li 904/17/2009

Solving an MDP• Planning (when and are known)

– Dynamic programming, linear programming, …– Relatively easy to analyze

• Learning (when or are unknown)– Q-learning [Watkins 89], …– Fundamentally harder– Exploration/exploitation dilemma

Lihong Li 1004/17/2009

Exploration/Exploitation Dilemma

• Similar to– active learning (like selective sampling)– bandit problems (Ad ranking)

• But different/harder– Many heuristics may fail

Take optimal actions“exploitation”:

rewardmaximization

Needestimate

and “exploration”:knowledgeacquisition

Try suboptimal actions

“dual control”

Lihong Li 1104/17/2009

Combination Lock

total rewards

poor (insufficient)exploration

active (efficient)exploration

optimal policy

1 2 3 98 99 100 0 0 0 0 0 0

Lihong Li 1204/17/2009

PAC-MDP RL• RL algorithm viewed as a non-stationary policy:

• Sample complexity [Kakade 03] (given ):

• A is PAC-MDP (Probably Approximate Correct in MDP) [Strehl, Li, Wiewiora, Langford & Littman 06] if:– With prob. at least – The sample complexity is

1 1 1 1:t t t tA s a r r s a

*1,2,3, | ( , ) ( , )tAt t t tt Q s a Q s a

0,1 0 1

1 1 1poly , , ,

In words…

We want the algorithmto act near optimally

except in a small number of steps

Lihong Li 1304/17/2009

Why PAC-MDP?

• Sample complexity– number of steps where learning/exploration happens– related to “learning speed” or “exploration efficiency”

• Roles of parameters– : allow small sub-optimality– : allow failure due to unlucky data– |M|: measures problem complexity– 1/(1-): larger makes problem harder

• Generality– No assumption on ergodicity– No assumption on mixing– No need for reset or generative model

Lihong Li 1404/17/2009

Rmax [Brafman & Tenenholtz 02]

• Rmax is for finite-state, finite-action MDPs• Learns T and R by counting/averaging• In st, takes optimal action in

( ' | , )

T s s a

Knownstate-actions

Unknownstate-actions

“Optimism in the face of uncertainty”:

Either: explore “unknown” region

Or: exploit “known” region

Thm: Rmax is PAC-MDP [Kakade 03]SxA

Known , , , ,M S A T R

* 2max

1( , ) 1

1Q s a Q

KnownM

Lihong Li 1504/17/2009

Outline

• Conclusions

Lihong Li 1604/17/2009

• KWIK: Knows What It Knows [Li & Littman & Walsh 08]

• A self-aware, supervised-learning model– Input set: X– Output set: Y– Observation set: Z– Hypothesis class: H µ (X Y)– Target function: h* 2 H

• “Realizable assumption”

– Special symbol: ? (“I don’t know”)

KWIK Notation

Lihong Li 1704/17/2009

KWIK Definition

,1 prob. w.if succeeds Learning Given: , , H

Env: Pick h* 2 Hsecretly & adversarially

Env: Pick x adversarially

Learner “ŷ”

“?”Observe y=h*(x) [deterministic]or measurement z [stochastic

where E[z]=h*(x)]

“I know”

“I don’t know”

W/prob. 1- , all predictions are correct |ŷ - h*(x)| ≤

Total #? is small at most poly(1/,1/,dim(H))

Learning succeeds if

Lihong Li 1804/17/2009

PACiid inputslabels up frontno mistakes

MBadversarial inputlabels when wrong

KWIKadversarial inputlabels on requestno mistakes

correctincorrect

request

Related Frameworks

PAC: Probably Approximately Correct [Valiant 84]

MB: Mistake Bound [Littlestone 87]KWIK: Knows What It Knows [Li & Littman & Walsh 08]

(if one-way functions exist)[Blum 94]

(may be exponentially harder)[Li & Littman & Walsh 08]

Lihong Li 1904/17/2009

Deterministic / Finite Case(X or H is finite)

Thought Experiment:You own a bar frequented by n patrons… – One is an instigator. When he shows up, there is a fight, unless

– Another patron, the peacemaker, is also there.– We want to predict, for a subset of patrons, {fight or no-fight}

Alg. 1: Memorization

• Memorize outcome for each subgroup of patrons• Predict ? if unseen before• #? ≤ |X|• Bar-fight: #? · 2n

Alg. 2: Enumeration• Enumerate all consistent (instigator, peacemaker) pairs• Say ? when they disagree• #? ≤ |H| -1• Bar-fight: #? · n(n-1)

Can make accurate predictions before complete identification of h*

Lihong Li 2004/17/2009

• Problem– Learn a multinomial distribution over N outcomes

• Same input at all times– Observe outcomes, not actual probabilities

• Algorithm– Predict ? for the first times– Use empirical estimate afterwards– Correctness follows from Chernoff’s bound

• Building block for many other stochastic cases

Stochastic / Finite Case:Dice-Learning

2 logO N N

Lihong Li 2104/17/2009

More Examples

• Distance to an unknown point in <n

[Li & Littman & Walsh 08]

• Linear functions with white noise[Strehl & Littman 08]

[Walsh & Szita & Diuk & Littman 09]

• Gaussian distributions[Brunskill & Leffler & Li & Littman & Roy 08]

Lihong Li 2204/17/2009

Outline

• Conclusions

Lihong Li 2304/17/2009

Model-based RL• Model-based RL (in )

– First learn T and R– Then uses to compute

• Simulation lemma [Kearns & Singh 02]

• Building a model often makes more efficient use of training data in practice

, , , ,M S A T R

( | , ) ( | , ) (1 )

(1 )( , ) ( , )

T s a T s aQ Q

R s a R s a

* *Q̂ Qˆ ˆ ˆ, , , ,M S A T R

Lihong Li 2404/17/2009

KWIK-Rmax [Li et al. 09]• Generalizes Rmax to general MDPs• KWIK-learns T and R simultaneously• In st, takes optimal action in

( ' | , )

T s s a

R s a “Optimism in the face of uncertainty”:

Either: explore “unknown” region

Or: exploit “known” region

Known , , , ,M S A T R

* 2max

1( , ) 1

1Q s a Q

Rmax [Brafman & Tenenholtz 02]

Rmax is for finite-state, finite-action MDPs

Learns T and R by counting/averaging

KnownM

Knownstate-actions

Unknownstate-actionsSxA

Lihong Li 2504/17/2009

KWIK-Rmax Analysis

• Explore-or-Exploit Lemma [Li et al. 09]– KWIK-Rmax either follows -optimal policy, or– explores an unknown state

• allowing KWIK-learners to learn T and R!

• Theorem [Li et al. 09]: KWIK-Rmax is PAC-MDP w/ sample complexity

( (1 ) , ) ( (1 ), )

(1 )T RB B

Lihong Li 2604/17/2009

KWIK-Learning Finite MDPsby Input-Partition

• T(.|s,a) is multinomial distribution– There are |S||A| many of them– Each indexed by (s,a)

Input-Partition

T(.|s1,a1) T(.|s1,a2) T(.|sn,am)……

S S AO S A

[Brafman & Tenenholtz 02][Kakade 03][Strehl & Li & Littman 06]

Environment

x=(s1,a2)

dice-learning dice-learning dice-learning

Lihong Li 2704/17/2009

• DBN representation [Dean & Kanazawa 89]

Network topologies from [Guestrin & Koller & Parr & Venkataraman 03]

Factored-State MDPs

Bidirectional Ring Star Ring and Star

3 LegsRing of Rings

Lihong Li 2804/17/2009

Factored-State MDPs

• DBN representation [Dean & Kanazawa 89]

– Assuming #parents is bounded by a constant D

S’nt t+1

S = (S1,S2,…,Sn)

Challenges: How to estimate Ti(si’ | parents(si’),a)? How to discover parents of each si’? How to combine learners L(si’) and L(sj’)?

1 2 1 2( '| , ,..., , ) ( '| , ,..., , )

( '|parents( '), )

n i i ni

i i ii

T s s s s a T s s s s a

T s s a

Lihong Li 2904/17/2009

KWIK-Learning DBNswith Unknown Structure

Noisy-Union

Input-Partition

Dice-Learning

Entries in CPT

Learning a DBN

Discovery of parents of si’

Cross-ProductCPTs for T(si’ | parent(si’), a)

From [Kearns & Koller 99]:“This paper leaves many interesting problems unaddressed. Of these, the most intriguing one is to allow the algorithm to learn the model structure as well as the parameters. The recent body of work on learning Bayesian networks from data [Heckerman, 1995] lays much of the foundation, but the integration of these ideas with the problems of exploration/exploitation is far from trivial.”[Li & Littman & Walsh 08]

[Diuk & Li & Leffler 09] First solved by [Strehl & Diuk & Littman 07]

Lihong Li 3004/17/2009

Experiment: “System Administrator”

Met-Rmax [Diuk & Li & Leffler 09]SLF-Rmax [Strehl & Diuk & Littman 07]Factored Rmax [Guestrin & Patrascu & Schuurmans 02]

Ring network8 machines9 actions

Lihong Li 3104/17/2009

MDPs with Gaussian Dynamics• Examples: robot navigation, transportation planning• State offset is multi-variate normal distribution

type( ), type( ),( | , ) ( , )s a s aT s a s N

CORL [Brunskill & Leffler & Li &Littman & Roy 08]

RAM-Rmax [Leffler & Littman &Edmunds 07]

(video by Leffler)

Lihong Li 3204/17/2009

Outline

• Conclusions

Lihong Li 3304/17/2009

Model-free RL

• Estimate directly– Implying– No need to estimate T or R

• Benefits– Tractable computation complexity– Tractable space complexity

• Drawbacks– Seems to makes inefficient use of data– Are there PAC-MDP model-free algorithms?

Lihong Li 3404/17/2009

PAC-MDP Model-free RL* *

Bellman equation: ( , ) ( , ) max ( ', ')

Bellman error: ( , ) ( , ) max ( ', ') ( , )

Q s a R s a Q s a

E s a R s a Q s a Q s a

Can be KWIK-learned

optimistic Q-functions

),( as

small E(s,a) near-optimal (exploit)

explore

Lihong Li 3504/17/2009

Delayed Q-learning

• Delayed Q-learning (for finite MDPs)

first known PAC-MDP model-free algorithm[Strehl-Li-Wiewiora-Langford-Littman 06]

• Similar to Q-learning [Watkins 89]

– Minimal computation complexity– Minimal space complexity

Lihong Li 3604/17/2009

Comparison

Lihong Li 3704/17/2009

Improved Lower Bound for Finite MDPs

• Lower bound for N=1 [Mannor & Tsitsiklis 04]:

• Theorem: a new lower bound

• Delayed Q-learning’s upper bound:

logS A

Lihong Li 3804/17/2009

KWIK with Linear Function Approximation

• Linear FA:

• LSPI-Rmax [Li & Littman & Mansley 09]– LSPI [Lagoudakis & Parr 03] with online exploration– (s,a) is unknown if under-represented in training set– Includes Rmax as a special case

• REKWIRE [Li & Littman 08]– For finite-horizon MDPs– Learns Q in a bottom-up manner

( , ) ( , ) ( , )k

Q s a w s a w s a

Lihong Li 3904/17/2009

Outline

• Conclusions

Lihong Li 4004/17/2009

Open Problems

• Agnostic learning [Kearns & Schapire & Sellie 94] in KWIK– Hypothesis class H may not include h*– “Unrealizable” KWIK [Li & Littman 08]

• Prior information in RL– Bayesian prior [Asmuth & Li & Littman & Nouri & Wingate 09]– Heuristic/shaping [Asmuth & Littman & Zinkov 08] [Strehl & Li & Littman 09]

• Approximate RL with KWIK– Least-squares policy iteration [Li & Littman & Mansley 09]– Fitted value iteration [Brunskill & Leffler & Li & Littman & Roy 08]– Linear function approximation [Li & Littman 08]

Lihong Li 4104/17/2009

Conclusions: A Unification

KWIK[Li & Littman & Walsh 08]

Finite MDP[Kearns & Singh 02]

[Brafman & Tenenholtz 02][Kakade 03]

[Strehl & Li & Littman 06]

Linear MDP[Strehl & Littman 08]

RAM-MDP[Leffler & Littman & Edmunds 07]

Gaussian-Offset MDP[Brunskill & Leffer & Li & Littman & Roy 08]

Factored MDP[Kearns & Koller 99]

[Strehl & Diuk & Littman 07][Li & Littman & Walsh 08]

[Diuk & Li & Leffler 09]

Delayed-Observation MDP[Walsh & Nouri & Li & Littman 07]

Finite MDP[Strehl & Li & Wiewiora &Langford & Littman 06]

KWIK-based VFA[Li & Littman 08]

[Li & Mansley & Littman 09]

MatchingLowerBound

model-based

model-free

The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way

for creating and analyzing RL algorithmswith provably efficient exploration.

Lihong Li 4204/17/2009

1. Li, Littman, & Walsh: “Knows what it knows: A framework for self-aware learning”. In ICML 2008.

2. Diuk, Li, & Leffler: “The adaptive k-meteorologist problem and its applications to structure discovery and feature selection in reinforcement learning”. In ICML 2009.

3. Brunskill, Leffler, Li, Littman, & Roy: “CORL: A continuous-state offset-dynamics reinforcement learner”. In UAI 2008.

4. Walsh, Nouri, Li, & Littman: “Planning and learning in environments with delayed feedback”. In ECML 2007.

5. Strehl, Li, & Littman: “Incremental model-based learners with formal learning-time guarantees”. In UAI 2006.

6. Li, Littman, & Mansley: “Online exploration in least-squares policy iteration”. In AAMAS 2009.

7. Li & Littman: “Efficient value-function approximation via online linear regression”. In AI&Math 2008.

8. Strehl, Li, Wiewiora, Langford, & Littman: “PAC model-free reinforcement learning”. In ICML 2006.

ReferencesKWIK

A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory...

Documents

Transcript of A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory...

Nonstrict Hierarchical Reinforcement Learning for ...research.cs.rutgers.edu/~lihong/ftp/papers/dialog/Nonstrict... · Since ﬁnding the best state–action space for a learning

Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reinforcement Learning & Apprenticeship Learning

Reinforcement Learning via an Optimization Lens · 2020. 1. 3. · Reinforcement Learning via an Optimization Lens Lihong Li August 7, 2019 Google Brain Simons Institute Workshop

Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Reinforcement Learning - uni-freiburg.degki.informatik.uni-freiburg.de/.../recordings/reinforcement.pdf · Reinforcement Learning 3 What is Reinforcement Learning? Learning from interaction

Reinforcement Learning

Tutorial: Deep Reinforcement Learning - Machine Learning ...hunch.net/~beygel/deep_rl_tutorial.pdfTutorial: Deep Reinforcement Learning - Machine Learning ...

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning

From Reinforcement Learning to Deep Reinforcement …fagostin/assets/files/...Keywords: Machine learning · Reinforcement learning Deep learning · Deep reinforcement learning 1 Introduction

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Deep Learning for Reinforcement Learning in Pacman · Deep Learning for Reinforcement Learning in Pacman Deep Learning für Reinforcement Learning in Pacman Vorgelegte Bachelor-Thesis

Reinforcement Learning - Multi-Agent Reinforcement ...

Inverse Reinforcement Learning CS885 Reinforcement ...

Generalization in Reinforcement Learning: Successful ...papers.nips.cc/paper/1109-generalization-in-reinforcement-learning... · Generalization in Reinforcement Learning: Successful

Universal Reinforcement Learning Algorithms: … Reinforcement Learning Algorithms: Survey and ... class of models hypothesesenvironments M . ... Universal Reinforcement Learning Algorithms:

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Reinforcement Learning Lecture Inverse Reinforcement Learningipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2017/07/09... · Reinforcement Learning Inverse Reinforcement

Reinforcement Learning: Learning algorithms

Multi-Objective Reinforcement Learning using Sets of Pareto … · 2020. 10. 19. · learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement