Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr...

Max-norm Projections for Factored MDPs

Carlos Guestrin

Daphne KollerStanford University

Ronald ParrDuke University

Motivation MDPs: plan over

atomic system states; Policy — specifies

action at every state; Polytime algorithms

for finding optimal policy.

Number of states exponential in state variables.

Motivation: BNs meet MDPs

Real-world MDPS have: Hundreds of variables; Googles of states.

Can we exploit problem specific structure?

For representation; For planning.

Goal: Merge BN and MDPs for Efficient Computation.

Factored MDPs [Boutilier et al.]

Total reward adding sub-rewards:R=R1+R2

R2

Z

R1

Y’

Z’

Y

X’ X

Time t t+1

Actions only change small parts of model.

Value function: Value of policy starting at state s.

Exploiting Structure

Structured value function approach: [Boutilier et al. ‘95] Collapse value function using a tree; Works well only when many states have same

value. X

3)( =XV Z

5)( =ZXV 9)( =ZXV

Model structure may imply structured value function;

Decomposable Value Functions

Each hi is the status of some small part(s) of a complex system: status of a machine; inventory of a store.

∑=i ii shwsV )()(

~Linear combination of restricted domain functions. [Bellman et al. ‘63][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00]

AwV =~

K basis functions

2n states

h1(s1) h2(s1)...h1(s2) h2(s2)…...

A=

Our Approach

Embed structure into value function space a priori: Project into structured vector space of factored value

functions; Efficiently find closest approximation to “true” value.

∑=k kkhwV

~

Linear Combinationof Structured Features

Policy Iteration

Value of acting on

Guess V= greedy(V)V = value of acting on

VPRV γ+=(2nx2n)(2nx1) (2nx1)

Value RewardDiscounted expected value

Approximate Policy Iteration

Guess w0

t= greedy(A wt)Awt+1 value of acting on t

AwPRAw γ+≈ Approximate value determination:

Approximate Value Determination

Need a projection of the value function into thespace of the basis functions: (Ld projection)

( )dw AwPRAww ππ γ+−= minarg

Previous work uses L2 and weighted-L2 projections.

[Koller & Parr ’99, ’00]

AwPRAw γ+≈

( ) .max ...1 ∞= +−= τππττ ττγβ AwPRAwt

P

Analysis of Approx. PI

Theorem:

;)1(

22

*0

*

γγβγ−

+−≤−∞∞

Pt

t VAwVAw

We should be doing projections in Max-norm!

( )∞

−−= γ RwAPAw wminarg

Approximate PI: Revisited

Guess w0


AwPRAw γ+≈ Approximate value determination:

Analysis motivating projections in max-norm;

Efficient algorithm for max-norm

projection.

Efficient Max-norm Projection

Computing max-norm for fixed weights;

Cost networks; Efficient max-norm projection.

( )∞


∞−= bHww wminarg

AwPRAw γ+≈

Max over Large State Spaces

For fixed weights w, compute max-norm:

)()(max sbshwbHwi

iis

−=−= ∑∞φ

However, if basis and target are functions of only a few variables, we can do it efficiently!

Cost Networks can maximize over large state spaces efficiently when function is factored: { }niii

XXXXCwhereCf

n

KK

1,)(max1

⊆∑




( )∞



AwPRAw γ+≈

Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72]

Cost Networks

[ ]),(),(),(max

),(),(max),(),(max

),(),(),(),(max

121,,

4321,,

4321,,,

CBgCAfBAf

DBfDCfCAfBAf

DBfDCfCAfBAf

CBA

DCBA

DCBA

++=

+++=

+++ A

D

B C

1f

4f 3f

2f

As in Bayes nets, maximization is exponential in size of largest factor.

Here we need only 16, instead of 64 sum operations.




( )∞



AwPRAw γ+≈

Algorithm for finding:

∞−∈ bHww wminarg*

.)()(max

)()(max:

;:;,,...,:

1

1

1

⎟⎠

⎞⎜⎝

⎛ −≥

⎟⎠

⎞⎜⎝

⎛ −≥

∑

∑

=

=

k

iiis

k

iiis

k

shwsb

andsbshwtoSubject

MinimizewwVariables

φ

φ

φφ

Max-norm Projection

Solve by Linear Programming: [Cheney ’82]

Representing the Constraints

Explicit representation is exponential (|S|=2n):

Sssbshwk

iii K1,)()(

1

=−≥ ∑=

φ

If basis and target are factored, can use Cost Networks to represent the constraints:

[ ]),(),(max),(),(max 4321,,

DBfDCfCAfBAfDCBA

+++≥φ

),(),(

),(),(max

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

+≥

++≥φ

Approximate Policy Iteration

Guess w0


How do represent the policy? How do we update it efficiently?

PolicyImprovement

What about the Policy ?Contextual Action Model:

Z

Y’

Z’

Y

X’ Xdefault

Z

Y’

Z’

Y

X’ XAction 1

Z

Y’

Z’

Y

X’ XAction 2

Factored value functions and model compact policy descriptionPolicy forms a decision list:

If then action 1 else if then action 2 else if then action 1

xyz

x

Theorem: [Koller & Parr ’00]

Factored Policy Iteration: Summary

Guess V = greedy(V)V = value of acting on

Structure inducesdecision-list policy

Key operations isomorphicto Bayesian Network inference

Time per iteration reduced from O((2n)3) to O(poly(k,n,C))

• C = largest factor in cost net (function of structure)• k = number of basis functions (k << 2n)• poly = complexity of LP solver, in practice close to linear

Network Management Problem

Computers connected in a network;

Each computer can fail with some probability;

If a computer fails, it increases the probability its neighbors will fail;

At every time step, the sys-admin must decide which computer to fix.

Bidirectional Ring Ring and Star

Server

Star

3 LegsRing of Rings

Server

Server

Comparing projections in L2

to L

Max-norm projection also much more efficient: Single cost network rather than many many BN

inferences; Use of very efficient LP package (CPLEX).

0

0.05

0.1

0.15

0.2

0.25

0.3

3 4 5 6 7 8 9 10

number of variables

Relative error:

L2 single basis

L single basis

L pair basis

L2 pair basis

Results on Larger Problems: Running Time

0

100

200

300

400

500

1E+00 1E+02 1E+04 1E+06 1E+08 1E+10 1E+12 1E+14

number of states

Total Time (minutes)

Ring

3 Legs

Star

Runs in time O(n3) not O((2n)3)

Results on Larger Problems: Error Bounds

0

0.1

0.2

0.3

0.4

1E+00 1E+02 1E+04 1E+06 1E+08 1E+10 1E+12 1E+14

number of states

Bellman Error / Rmax

Ring

3 Legs

Star

Error remains bounded

Conclusions Max-norm projection directly minimizes error

bounds;

Closed-form projection operation provides exponential complexity reduction;

Exploit structure to reduce computation costs! Solve very large MDPs efficiently.

Future Work

POMDPs (IJCAI’01 workshop paper);

Additional structure: Factored actions; Relational representations; CSI;

Multi-agent systems;

Linear program solution for MDP.

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr...

Documents

Transcript of Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr...