Post on 16-Dec-2015
A Unifying Framework for Computational Reinforcement
Learning TheoryLihong Li
Rutgers Laboratory for Real-Life Reinforcement Learning (RL3)Department of Computer Science, Rutgers University
PhD Defense CommitteeMichael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy
Joint work withMichael Littman, Alex Strehl, Tom Walsh, …
Lihong Li 204/17/2009
$ponsored $earch
Are these better alternatives? Need to EXPLORE!
Lihong Li 304/17/2009
Thesis
The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way
for creating and analyzing RL algorithmswith provably efficient exploration.
Lihong Li 404/17/2009
Outline
• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL
– Model-based Approaches– Model-free Approaches
• Conclusions
Lihong Li 504/17/2009
100K dimensionaldialog state,features, etc.
Reinforcement Learning Example
Speech Recognition,NLP, Belief Tracking, etc.
Language Generation,Text-to-speech, etc.
responsesto user
“May I speak toJohn Smith?”
Confirm(“John Smith”)
“So you want to callJohn Smith, is that right?”
-1 per response+20 if succeeds-20 if fails
reward
Optimizedby RL
Dialog Design Objectivesucceed in conversationwith fewest responses
AT&T Dialer [Li & Williams & Balakrishnan 09] states
actions
Want to call someone at AT&T
Lihong Li 604/17/2009
RL Summary
European Workshop on Reinforcement Learning 2008
Define reward and let the agent chase it!
Lihong Li 704/17/2009
Markov Decision Process
• Environment is often modeled as an MDP
s1 s2 st
time
st+1
Set of states
Set of actions
Transition probabilities
Reward function
Discount factor in (0,1)
, , , ,M S A T R
1a
1 1 1( , )r R s a
2 1 1( | , )s T s a ta
( , )t t tr R s a
1 ( | , )t t ts T s a
Regularity Assumption:
0 ( , ) 1R s a
Lihong Li 804/17/2009
Policies and Value Functions
• Policy:
• Value function:
• Optimal value function:
• Optimal policy:
• Solving an MDP:
: S A
21 2 3( , )Q s a r r r
E
*( , ) max ( , )Q s a Q s a
* *( ) arg max ( , )a
s Q s a
* * arg max ( , ) ( ) a
Q Q Q s a s
Lihong Li 904/17/2009
Solving an MDP• Planning (when and are known)
– Dynamic programming, linear programming, …– Relatively easy to analyze
• Learning (when or are unknown)– Q-learning [Watkins 89], …– Fundamentally harder– Exploration/exploitation dilemma
T R
T R
Lihong Li 1004/17/2009
Exploration/Exploitation Dilemma
• Similar to– active learning (like selective sampling)– bandit problems (Ad ranking)
• But different/harder– Many heuristics may fail
Take optimal actions“exploitation”:
rewardmaximization
Needestimate
and “exploration”:knowledgeacquisition
Try suboptimal actions
“dual control”
T R
Lihong Li 1104/17/2009
Combination Lock
time
total rewards
poor (insufficient)exploration
active (efficient)exploration
optimal policy
1 2 3 98 99 100 0 0 0 0 0 0
1000
0.001
0
Lihong Li 1204/17/2009
PAC-MDP RL• RL algorithm viewed as a non-stationary policy:
• Sample complexity [Kakade 03] (given ):
• A is PAC-MDP (Probably Approximate Correct in MDP) [Strehl, Li, Wiewiora, Langford & Littman 06] if:– With prob. at least – The sample complexity is
1 1 1 1:t t t tA s a r r s a
0
*1,2,3, | ( , ) ( , )tAt t t tt Q s a Q s a
A
0,1 0 1
1 1 1poly , , ,
1M
In words…
We want the algorithmto act near optimally
except in a small number of steps
Lihong Li 1304/17/2009
Why PAC-MDP?
• Sample complexity– number of steps where learning/exploration happens– related to “learning speed” or “exploration efficiency”
• Roles of parameters– : allow small sub-optimality– : allow failure due to unlucky data– |M|: measures problem complexity– 1/(1-): larger makes problem harder
• Generality– No assumption on ergodicity– No assumption on mixing– No need for reset or generative model
Lihong Li 1404/17/2009
Rmax [Brafman & Tenenholtz 02]
• Rmax is for finite-state, finite-action MDPs• Learns T and R by counting/averaging• In st, takes optimal action in
( ' | , )
( , )
T s s a
R s a
Knownstate-actions
Unknownstate-actions
“Optimism in the face of uncertainty”:
Either: explore “unknown” region
Or: exploit “known” region
Thm: Rmax is PAC-MDP [Kakade 03]SxA
max
1
1Q
Known , , , ,M S A T R
* 2max
1( , ) 1
1Q s a Q
KnownM
Lihong Li 1504/17/2009
Outline
• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL
– Model-based Approaches– Model-free Approaches
• Conclusions
Lihong Li 1604/17/2009
• KWIK: Knows What It Knows [Li & Littman & Walsh 08]
• A self-aware, supervised-learning model– Input set: X– Output set: Y– Observation set: Z– Hypothesis class: H µ (X Y)– Target function: h* 2 H
• “Realizable assumption”
– Special symbol: ? (“I don’t know”)
KWIK Notation
Lihong Li 1704/17/2009
KWIK Definition
,1 prob. w.if succeeds Learning Given: , , H
Env: Pick h* 2 Hsecretly & adversarially
Env: Pick x adversarially
Learner “ŷ”
“?”Observe y=h*(x) [deterministic]or measurement z [stochastic
where E[z]=h*(x)]
“I know”
“I don’t know”
W/prob. 1- , all predictions are correct |ŷ - h*(x)| ≤
Total #? is small at most poly(1/,1/,dim(H))
Learning succeeds if
Lihong Li 1804/17/2009
PACiid inputslabels up frontno mistakes
MBadversarial inputlabels when wrong
KWIKadversarial inputlabels on requestno mistakes
correctincorrect
request
Related Frameworks
PAC: Probably Approximately Correct [Valiant 84]
MB: Mistake Bound [Littlestone 87]KWIK: Knows What It Knows [Li & Littman & Walsh 08]
(if one-way functions exist)[Blum 94]
(may be exponentially harder)[Li & Littman & Walsh 08]
Lihong Li 1904/17/2009
Deterministic / Finite Case(X or H is finite)
Thought Experiment:You own a bar frequented by n patrons… – One is an instigator. When he shows up, there is a fight, unless
– Another patron, the peacemaker, is also there.– We want to predict, for a subset of patrons, {fight or no-fight}
19
Alg. 1: Memorization
• Memorize outcome for each subgroup of patrons• Predict ? if unseen before• #? ≤ |X|• Bar-fight: #? · 2n
Alg. 2: Enumeration• Enumerate all consistent (instigator, peacemaker) pairs• Say ? when they disagree• #? ≤ |H| -1• Bar-fight: #? · n(n-1)
Can make accurate predictions before complete identification of h*
Lihong Li 2004/17/2009
• Problem– Learn a multinomial distribution over N outcomes
• Same input at all times– Observe outcomes, not actual probabilities
• Algorithm– Predict ? for the first times– Use empirical estimate afterwards– Correctness follows from Chernoff’s bound
• Building block for many other stochastic cases
Stochastic / Finite Case:Dice-Learning
2 logO N N
Lihong Li 2104/17/2009
More Examples
• Distance to an unknown point in <n
[Li & Littman & Walsh 08]
• Linear functions with white noise[Strehl & Littman 08]
[Walsh & Szita & Diuk & Littman 09]
• Gaussian distributions[Brunskill & Leffler & Li & Littman & Roy 08]
Lihong Li 2204/17/2009
Outline
• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL
– Model-based Approaches– Model-free Approaches
• Conclusions
Lihong Li 2304/17/2009
Model-based RL• Model-based RL (in )
– First learn T and R– Then uses to compute
• Simulation lemma [Kearns & Singh 02]
• Building a model often makes more efficient use of training data in practice
, , , ,M S A T R
*1 *
2
( | , ) ( | , ) (1 )
(1 )( , ) ( , )
TR T
R
T s a T s aQ Q
R s a R s a
* *Q̂ Qˆ ˆ ˆ, , , ,M S A T R
Lihong Li 2404/17/2009
KWIK-Rmax [Li et al. 09]• Generalizes Rmax to general MDPs• KWIK-learns T and R simultaneously• In st, takes optimal action in
( ' | , )
( , )
T s s a
R s a “Optimism in the face of uncertainty”:
Either: explore “unknown” region
Or: exploit “known” region
max
1
1Q
Known , , , ,M S A T R
* 2max
1( , ) 1
1Q s a Q
Rmax [Brafman & Tenenholtz 02]
Rmax is for finite-state, finite-action MDPs
Learns T and R by counting/averaging
KnownM
Knownstate-actions
Unknownstate-actionsSxA
Lihong Li 2504/17/2009
KWIK-Rmax Analysis
• Explore-or-Exploit Lemma [Li et al. 09]– KWIK-Rmax either follows -optimal policy, or– explores an unknown state
• allowing KWIK-learners to learn T and R!
• Theorem [Li et al. 09]: KWIK-Rmax is PAC-MDP w/ sample complexity
2
2
( (1 ) , ) ( (1 ), )
(1 )T RB B
O
Lihong Li 2604/17/2009
KWIK-Learning Finite MDPsby Input-Partition
• T(.|s,a) is multinomial distribution– There are |S||A| many of them– Each indexed by (s,a)
Input-Partition
T(.|s1,a1) T(.|s1,a2) T(.|sn,am)……
2# ln
S S AO S A
[Brafman & Tenenholtz 02][Kakade 03][Strehl & Li & Littman 06]
Environment
x=(s1,a2)
dice-learning dice-learning dice-learning
Lihong Li 2704/17/2009
• DBN representation [Dean & Kanazawa 89]
Network topologies from [Guestrin & Koller & Parr & Venkataraman 03]
Factored-State MDPs
Bidirectional Ring Star Ring and Star
3 LegsRing of Rings
Lihong Li 2804/17/2009
Factored-State MDPs
• DBN representation [Dean & Kanazawa 89]
– Assuming #parents is bounded by a constant D
S1
S2
S3
Sn
S’1
S’2
S’3
S’nt t+1
a
S = (S1,S2,…,Sn)
Challenges: How to estimate Ti(si’ | parents(si’),a)? How to discover parents of each si’? How to combine learners L(si’) and L(sj’)?
1 2 1 2( '| , ,..., , ) ( '| , ,..., , )
( '|parents( '), )
n i i ni
i i ii
T s s s s a T s s s s a
T s s a
Lihong Li 2904/17/2009
KWIK-Learning DBNswith Unknown Structure
Noisy-Union
Input-Partition
Dice-Learning
Entries in CPT
Learning a DBN
Discovery of parents of si’
Cross-ProductCPTs for T(si’ | parent(si’), a)
From [Kearns & Koller 99]:“This paper leaves many interesting problems unaddressed. Of these, the most intriguing one is to allow the algorithm to learn the model structure as well as the parameters. The recent body of work on learning Bayesian networks from data [Heckerman, 1995] lays much of the foundation, but the integration of these ideas with the problems of exploration/exploitation is far from trivial.”[Li & Littman & Walsh 08]
[Diuk & Li & Leffler 09] First solved by [Strehl & Diuk & Littman 07]
Lihong Li 3004/17/2009
Experiment: “System Administrator”
Met-Rmax [Diuk & Li & Leffler 09]SLF-Rmax [Strehl & Diuk & Littman 07]Factored Rmax [Guestrin & Patrascu & Schuurmans 02]
Ring network8 machines9 actions
Lihong Li 3104/17/2009
MDPs with Gaussian Dynamics• Examples: robot navigation, transportation planning• State offset is multi-variate normal distribution
type( ), type( ),( | , ) ( , )s a s aT s a s N
CORL [Brunskill & Leffler & Li &Littman & Roy 08]
RAM-Rmax [Leffler & Littman &Edmunds 07]
(video by Leffler)
Lihong Li 3204/17/2009
Outline
• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL
– Model-based Approaches– Model-free Approaches
• Conclusions
Lihong Li 3304/17/2009
Model-free RL
• Estimate directly– Implying– No need to estimate T or R
• Benefits– Tractable computation complexity– Tractable space complexity
• Drawbacks– Seems to makes inefficient use of data– Are there PAC-MDP model-free algorithms?
*Q Q*
Q
Lihong Li 3404/17/2009
PAC-MDP Model-free RL* *
' '
' '
Bellman equation: ( , ) ( , ) max ( ', ')
Bellman error: ( , ) ( , ) max ( ', ') ( , )
s a
s a
Q s a R s a Q s a
E s a R s a Q s a Q s a
E
E
Can be KWIK-learned
*Q
1
1
1
1
1Q
optimistic Q-functions
),( as
tKAS
small E(s,a) near-optimal (exploit)
explore
S A
Lihong Li 3504/17/2009
Delayed Q-learning
• Delayed Q-learning (for finite MDPs)
first known PAC-MDP model-free algorithm[Strehl-Li-Wiewiora-Langford-Littman 06]
• Similar to Q-learning [Watkins 89]
– Minimal computation complexity– Minimal space complexity
Lihong Li 3604/17/2009
Comparison
Lihong Li 3704/17/2009
Improved Lower Bound for Finite MDPs
• Lower bound for N=1 [Mannor & Tsitsiklis 04]:
• Theorem: a new lower bound
• Delayed Q-learning’s upper bound:
2
1log
A
2log
S A S
logS A
O S A
Lihong Li 3804/17/2009
KWIK with Linear Function Approximation
• Linear FA:
• LSPI-Rmax [Li & Littman & Mansley 09]– LSPI [Lagoudakis & Parr 03] with online exploration– (s,a) is unknown if under-represented in training set– Includes Rmax as a special case
• REKWIRE [Li & Littman 08]– For finite-horizon MDPs– Learns Q in a bottom-up manner
1
( , ) ( , ) ( , )k
i ii
Q s a w s a w s a
Lihong Li 3904/17/2009
Outline
• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL
– Model-based Approaches– Model-free Approaches
• Conclusions
Lihong Li 4004/17/2009
Open Problems
• Agnostic learning [Kearns & Schapire & Sellie 94] in KWIK– Hypothesis class H may not include h*– “Unrealizable” KWIK [Li & Littman 08]
• Prior information in RL– Bayesian prior [Asmuth & Li & Littman & Nouri & Wingate 09]– Heuristic/shaping [Asmuth & Littman & Zinkov 08] [Strehl & Li & Littman 09]
• Approximate RL with KWIK– Least-squares policy iteration [Li & Littman & Mansley 09]– Fitted value iteration [Brunskill & Leffler & Li & Littman & Roy 08]– Linear function approximation [Li & Littman 08]
Lihong Li 4104/17/2009
Conclusions: A Unification
KWIK[Li & Littman & Walsh 08]
Finite MDP[Kearns & Singh 02]
[Brafman & Tenenholtz 02][Kakade 03]
[Strehl & Li & Littman 06]
Linear MDP[Strehl & Littman 08]
RAM-MDP[Leffler & Littman & Edmunds 07]
Gaussian-Offset MDP[Brunskill & Leffer & Li & Littman & Roy 08]
Factored MDP[Kearns & Koller 99]
[Strehl & Diuk & Littman 07][Li & Littman & Walsh 08]
[Diuk & Li & Leffler 09]
Delayed-Observation MDP[Walsh & Nouri & Li & Littman 07]
Finite MDP[Strehl & Li & Wiewiora &Langford & Littman 06]
KWIK-based VFA[Li & Littman 08]
[Li & Mansley & Littman 09]
MatchingLowerBound
model-based
model-free
The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way
for creating and analyzing RL algorithmswith provably efficient exploration.
Lihong Li 4204/17/2009
1. Li, Littman, & Walsh: “Knows what it knows: A framework for self-aware learning”. In ICML 2008.
2. Diuk, Li, & Leffler: “The adaptive k-meteorologist problem and its applications to structure discovery and feature selection in reinforcement learning”. In ICML 2009.
3. Brunskill, Leffler, Li, Littman, & Roy: “CORL: A continuous-state offset-dynamics reinforcement learner”. In UAI 2008.
4. Walsh, Nouri, Li, & Littman: “Planning and learning in environments with delayed feedback”. In ECML 2007.
5. Strehl, Li, & Littman: “Incremental model-based learners with formal learning-time guarantees”. In UAI 2006.
6. Li, Littman, & Mansley: “Online exploration in least-squares policy iteration”. In AAMAS 2009.
7. Li & Littman: “Efficient value-function approximation via online linear regression”. In AI&Math 2008.
8. Strehl, Li, Wiewiora, Langford, & Littman: “PAC model-free reinforcement learning”. In ICML 2006.
ReferencesKWIK
MBRL
MFRL