Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr...
-
date post
19-Dec-2015 -
Category
Documents
-
view
222 -
download
1
Transcript of Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr...
Max-norm Projections for Factored MDPs
Carlos Guestrin
Daphne KollerStanford University
Ronald ParrDuke University
Motivation MDPs: plan over
atomic system states; Policy — specifies
action at every state; Polytime algorithms
for finding optimal policy.
Number of states exponential in state variables.
Motivation: BNs meet MDPs
Real-world MDPS have: Hundreds of variables; Googles of states.
Can we exploit problem specific structure?
For representation; For planning.
Goal: Merge BN and MDPs for Efficient Computation.
Factored MDPs [Boutilier et al.]
Total reward adding sub-rewards:R=R1+R2
R2
Z
R1
Y’
Z’
Y
X’ X
Time t t+1
Actions only change small parts of model.
Value function: Value of policy starting at state s.
Exploiting Structure
Structured value function approach: [Boutilier et al. ‘95] Collapse value function using a tree; Works well only when many states have same
value. X
3)( =XV Z
5)( =ZXV 9)( =ZXV
Model structure may imply structured value function;
Decomposable Value Functions
Each hi is the status of some small part(s) of a complex system: status of a machine; inventory of a store.
∑=i ii shwsV )()(
~Linear combination of restricted domain functions. [Bellman et al. ‘63][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00]
AwV =~
K basis functions
2n states
h1(s1) h2(s1)...h1(s2) h2(s2)…...
A=
Our Approach
Embed structure into value function space a priori: Project into structured vector space of factored value
functions; Efficiently find closest approximation to “true” value.
∑=k kkhwV
~
Linear Combinationof Structured Features
Policy Iteration
Value of acting on
Guess V= greedy(V)V = value of acting on
VPRV γ+=(2nx2n)(2nx1) (2nx1)
Value RewardDiscounted expected value
Approximate Policy Iteration
Guess w0
t= greedy(A wt)Awt+1 value of acting on t
AwPRAw γ+≈ Approximate value determination:
Approximate Value Determination
Need a projection of the value function into thespace of the basis functions: (Ld projection)
( )dw AwPRAww ππ γ+−= minarg
Previous work uses L2 and weighted-L2 projections.
[Koller & Parr ’99, ’00]
AwPRAw γ+≈
( ) .max ...1 ∞= +−= τππττ ττγβ AwPRAwt
P
Analysis of Approx. PI
Theorem:
;)1(
22
*0
*
γγβγ−
+−≤−∞∞
Pt
t VAwVAw
We should be doing projections in Max-norm!
( )∞
−−= γ RwAPAw wminarg
Approximate PI: Revisited
Guess w0
t= greedy(A wt)Awt+1 value of acting on t
AwPRAw γ+≈ Approximate value determination:
Analysis motivating projections in max-norm;
Efficient algorithm for max-norm
projection.
Efficient Max-norm Projection
Computing max-norm for fixed weights;
Cost networks; Efficient max-norm projection.
( )∞
−−= γ RwAPAw wminarg
∞−= bHww wminarg
AwPRAw γ+≈
Efficient Max-norm Projection
Computing max-norm for fixed weights;
Cost networks; Efficient max-norm projection.
( )∞
−−= γ RwAPAw wminarg
∞−= bHww wminarg
AwPRAw γ+≈
Max over Large State Spaces
For fixed weights w, compute max-norm:
)()(max sbshwbHwi
iis
−=−= ∑∞φ
However, if basis and target are functions of only a few variables, we can do it efficiently!
Cost Networks can maximize over large state spaces efficiently when function is factored: { }niii
XXXXCwhereCf
n
KK
1,)(max1
⊆∑
Efficient Max-norm Projection
Computing max-norm for fixed weights;
Cost networks; Efficient max-norm projection.
( )∞
−−= γ RwAPAw wminarg
∞−= bHww wminarg
AwPRAw γ+≈
Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72]
Cost Networks
[ ]),(),(),(max
),(),(max),(),(max
),(),(),(),(max
121,,
4321,,
4321,,,
CBgCAfBAf
DBfDCfCAfBAf
DBfDCfCAfBAf
CBA
DCBA
DCBA
++=
+++=
+++ A
D
B C
1f
4f 3f
2f
As in Bayes nets, maximization is exponential in size of largest factor.
Here we need only 16, instead of 64 sum operations.
Efficient Max-norm Projection
Computing max-norm for fixed weights;
Cost networks; Efficient max-norm projection.
( )∞
−−= γ RwAPAw wminarg
∞−= bHww wminarg
AwPRAw γ+≈
Algorithm for finding:
∞−∈ bHww wminarg*
.)()(max
)()(max:
;:;,,...,:
1
1
1
⎟⎠
⎞⎜⎝
⎛ −≥
⎟⎠
⎞⎜⎝
⎛ −≥
∑
∑
=
=
k
iiis
k
iiis
k
shwsb
andsbshwtoSubject
MinimizewwVariables
φ
φ
φφ
Max-norm Projection
Solve by Linear Programming: [Cheney ’82]
Representing the Constraints
Explicit representation is exponential (|S|=2n):
Sssbshwk
iii K1,)()(
1
=−≥ ∑=
φ
If basis and target are factored, can use Cost Networks to represent the constraints:
[ ]),(),(max),(),(max 4321,,
DBfDCfCAfBAfDCBA
+++≥φ
),(),(
),(),(max
43),(
1
),(121
,,
DBfDCfg
gCAfBAf
CB
CB
CBA
+≥
++≥φ
Approximate Policy Iteration
Guess w0
t= greedy(A wt)Awt+1 value of acting on t
How do represent the policy? How do we update it efficiently?
PolicyImprovement
What about the Policy ?Contextual Action Model:
Z
Y’
Z’
Y
X’ Xdefault
Z
Y’
Z’
Y
X’ XAction 1
Z
Y’
Z’
Y
X’ XAction 2
Factored value functions and model compact policy descriptionPolicy forms a decision list:
If then action 1 else if then action 2 else if then action 1
xyz
x
Theorem: [Koller & Parr ’00]
Factored Policy Iteration: Summary
Guess V = greedy(V)V = value of acting on
Structure inducesdecision-list policy
Key operations isomorphicto Bayesian Network inference
Time per iteration reduced from O((2n)3) to O(poly(k,n,C))
• C = largest factor in cost net (function of structure)• k = number of basis functions (k << 2n)• poly = complexity of LP solver, in practice close to linear
Network Management Problem
Computers connected in a network;
Each computer can fail with some probability;
If a computer fails, it increases the probability its neighbors will fail;
At every time step, the sys-admin must decide which computer to fix.
Bidirectional Ring Ring and Star
Server
Star
3 LegsRing of Rings
Server
Server
Comparing projections in L2
to L
Max-norm projection also much more efficient: Single cost network rather than many many BN
inferences; Use of very efficient LP package (CPLEX).
0
0.05
0.1
0.15
0.2
0.25
0.3
3 4 5 6 7 8 9 10
number of variables
Relative error:
L2 single basis
L single basis
L pair basis
L2 pair basis
Results on Larger Problems: Running Time
0
100
200
300
400
500
1E+00 1E+02 1E+04 1E+06 1E+08 1E+10 1E+12 1E+14
number of states
Total Time (minutes)
Ring
3 Legs
Star
Runs in time O(n3) not O((2n)3)
Results on Larger Problems: Error Bounds
0
0.1
0.2
0.3
0.4
1E+00 1E+02 1E+04 1E+06 1E+08 1E+10 1E+12 1E+14
number of states
Bellman Error / Rmax
Ring
3 Legs
Star
Error remains bounded
Conclusions Max-norm projection directly minimizes error
bounds;
Closed-form projection operation provides exponential complexity reduction;
Exploit structure to reduce computation costs! Solve very large MDPs efficiently.
Future Work
POMDPs (IJCAI’01 workshop paper);
Additional structure: Factored actions; Relational representations; CSI;
Multi-agent systems;
Linear program solution for MDP.